Goto

Collaborating Authors

 expression recognition





CausalAffect: Causal Discovery for Facial Affective Understanding

Hu, Guanyu, Lian, Tangzheng, Kollias, Dimitrios, Celiktutan, Oya, Yang, Xinyu

arXiv.org Artificial Intelligence

Understanding human affect from facial behavior requires not only accurate recognition but also structured reasoning over the latent dependencies that drive muscle activations and their expressive outcomes. Although Action Units (AUs) have long served as the foundation of affective computing, existing approaches rarely address how to infer psychologically plausible causal relations between AUs and expressions directly from data. We propose CausalAffect, the first framework for causal graph discovery in facial affect analysis. CausalAffect models AU-AU and AU-Expression dependencies through a two-level polarity and direction aware causal hierarchy that integrates population-level regularities with sample-adaptive structures. A feature-level counterfactual intervention mechanism further enforces true causal effects while suppressing spurious correlations. Crucially, our approach requires neither jointly annotated datasets nor handcrafted causal priors, yet it recovers causal structures consistent with established psychological theories while revealing novel inhibitory and previously uncharacterized dependencies. Extensive experiments across six benchmarks demonstrate that CausalAffect advances the state of the art in both AU detection and expression recognition, establishing a principled connection between causal discovery and interpretable facial behavior. All trained models and source code will be released upon acceptance.


Graph-Attention Network with Adversarial Domain Alignment for Robust Cross-Domain Facial Expression Recognition

Ghaedi, Razieh, BabaAhmadi, AmirReza, Zwiggelaar, Reyer, Fan, Xinqi, Alam, Nashid

arXiv.org Artificial Intelligence

Cross-domain facial expression recognition (CD-FER) remains difficult due to severe domain shift between training and deployment data. We propose Graph-Attention Network with Adversarial Domain Alignment (GAT-ADA), a hybrid framework that couples a ResNet-50 as backbone with a batch-level Graph Attention Network (GAT) to model inter-sample relations under shift. Each mini-batch is cast as a sparse ring graph so that attention aggregates cross-sample cues that are informative for adaptation. To align distributions, GAT-ADA combines adversarial learning via a Gradient Reversal Layer (GRL) with statistical alignment using CORAL and MMD. GAT-ADA is evaluated under a standard unsupervised domain adaptation protocol: training on one labeled source (RAF-DB) and adapting to multiple unlabeled targets (CK+, JAFFE, SFEW 2.0, FER2013, and ExpW). GAT-ADA attains 74.39% mean cross-domain accuracy. On RAF-DB to FER2013, it reaches 98.0% accuracy, corresponding to approximately a 36-point improvement over the best baseline we re-implemented with the same backbone and preprocessing.


Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition

Tsangko, Iosif, Triantafyllopoulos, Andreas, Abdelmoula, Adem, Mallol-Ragolta, Adria, Schuller, Bjoern W.

arXiv.org Artificial Intelligence

--Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision-Language Models (VLMs) now capable of recognising emotions in zero-shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth-annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of -the best-performing model, i.e., GPT -4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues--especially in sensitive domains like mental health and education. Understanding and interpreting human emotions is fundamental to social interaction. From early developmental cues in infants, to high-stakes decision-making in adults, facial expressions serve as a primary channel for conveying affect.


ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition

Li, Yan, Zhao, Yong, Xia, Xiaohan, Jiang, Dongmei

arXiv.org Artificial Intelligence

The key to facial expression recognition is to learn discriminative spatial-temporal representations that embed facial expression dynamics. Previous studies predominantly rely on pre-trained Convolutional Neural Networks (CNNs) to learn facial appearance representations, overlooking the relationships between facial regions. To address this issue, this paper presents an Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) to learn mutually enhanced spatial-temporal representations of appearance and relation information. Specifically, we construct a facial region relation graph and leverage the graph attention mechanism to model the relationships between facial regions. The resulting relational representation sequences, along with CNN-based appearance representation sequences, are then fed into a parallel graph attention fusion module for mutual interaction and enhancement. This module simultaneously explores the complementarity between different representation sequences and the temporal dynamics within each sequence. Experimental results on three facial expression recognition datasets demonstrate that the proposed ARPGNet outperforms or is comparable to state-of-the-art methods.


Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification

Anchan, Akshit Pramod, Thomas, Jewelith, Roy, Sritama

arXiv.org Artificial Intelligence

Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.


Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

Jung, Gunho, Kong, Heejo, Lee, Seong-Whan

arXiv.org Artificial Intelligence

Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MILbased DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision. Introduction Facial expressions serve as direct indicators of human emotions, playing a crucial role in interpreting feelings during human interactions [4, 5]. Recognizing these expressions is essential in various fields, including human-computer interaction (HCI) [1], health assessment [2], and driver assistance systems [3].


Elements of Active Continuous Learning and Uncertainty Self-Awareness: a Narrow Implementation for Face and Facial Expression Recognition

Selitskiy, Stanislav

arXiv.org Artificial Intelligence

Reflection on one's thought process and making corrections to it if there exists dissatisfaction in its performance is, perhaps, one of the essential traits of intelligence. However, such high-level abstract concepts mandatory for Artificial General Intelligence can be modelled even at the low level of narrow Machine Learning algorithms. Here, we present the self-awareness mechanism emulation in the form of a supervising artificial neural network (ANN) observing patterns in activations of another underlying ANN in a search for indications of the high uncertainty of the underlying ANN and, therefore, the trustworthiness of its predictions. The underlying ANN is a convolutional neural network (CNN) ensemble employed for face recognition and facial expression tasks. The self-awareness ANN has a memory region where its past performance information is stored, and its learnable parameters are adjusted during the training to optimize the performance. The trustworthiness verdict triggers the active learning mode, giving elements of agency to the machine learning algorithm that asks for human help in high uncertainty and confusion conditions.